This report explores the Prosper loans dataset. According to Wikipedia, Prosper Marketplace is a company in the peer-to-peer lending industry and operates Prosper.com. Borrowers request personal loans on Prosper and investors can fund them. Prosper lists, collects and distributes borrower payments and interest back to the loan investors.
The dataset contains a little over 113,000 loans with 81 variables on each loan. Click this link for list of all the variable definitions: Prosper Loan Data - Variable Definitions.
There is a surprising drop in loans in 2009. In searching the news during that time I found this Prosper News Story.
In October of 2009, the SEC forced Prosper.com, to stop brokering new loans temporarily while it determined whether Prosper’s loans should be classified as securities. After a six month quiet period Prosper was reopened to lenders and borrowers. Prosper made other changes for its business including only allowing borrowers with a credit score above 640 to request a loan. I added cohort for before 2010 and after 2010 to see other interesting similarities and differences.
I’ve added color differences in the graphs for before and after the re-open
in 2010. There is a drop at the beginning of 2013 followed by fairly
consistent growth in number of loans peaking at 14450 in Q4 of 2013.
January is the biggest month for new loans, followed by October and December.
The number of loans listed rises over the month and peaks towards the end of
the month. The highest number appear the last day of the month on the 30th.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 2500 4200 6050 7500 25000
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1000 4000 7784 9191 14000 35000
There is a difference in loan amount before 2010 and after 2010. The minimum
loan amount pre-2010 and post-2010 remained the same at $1000. However, the
maximum loan amount increased from $25000 to $35000.
The mean increased from $6050 to $9190. Post-2010 there are spikes in number
of loans at $4000, $10000, and $15000.
For both pre-2010 and post-2010 data the distribution is skewed to the right
and a small number of loans greater than $25000.
Pre-2010 the listing category wasn’t very informative. Almost 17000 of the
loans had a listing category of “Not Available”. Post-2010 since there are
only 20 loans where this information is not available.
The debt consolidation category is the largest with a little over 50% of the
loans listed.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 82.44 149.03 212.06 271.25 1130.90
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0 160.1 256.4 295.0 390.4 2251.5
The distribution for MonthlyLoanPayment is right skewed - Both pre-2010 and
post-2010 mean values are greater than the median and there is a long tail to
the right. The median payments are $100 higher (256.40) post-2010 than
pre-2010 (149.03).
Loans are for 1 year, 3 year or 5 years Terms. Almost 80% of loans are for a
3 year term.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0.0 600.0 640.0 648.2 700.0 880.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 600.0 660.0 700.0 699.3 720.0 880.0
Here are the Credit Score Ranges as defined by experian, transunion, and equifax:
Pre-2010 loans were allowed to be listed with Proper for Borrowers with a
“Bad” credit rating. In addition there were 591 pre-2010 loans listed where
the lower credit score was “Not Available”.
The median rose from 640 pre-2010 to 700 post-2010. For pre-2010 lower credit
score data is skewed slightly to the right since the mean (648.2) is greater
than the median (640). The post-2010 lower credit score data is normally
distributed - The median and mean are very close at 700 and 699 respectively.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 19.0 619.0 659.0 667.2 719.0 899.0 591
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 619.0 679.0 719.0 718.3 739.0 899.0
Pre-2010 borrowers had a upper credit score median of 659 while post-2010
borrowers had a median upper credit score of 719. The upper credit score
values look normally distributed.
The Credit Grade is the rating that was assigned to pre-2010 loans at the
time the listing went live. I could not determine via the Prosper website
search how the Credit Grade was determined. Credit Grade, from lowest-risk
to highest-risk, are labeled AA, A, B, C, D, E, HR (“High Risk”), and NC
(“No Credit”).
There were 141 loans listed for No Credit “NC” borrowers with 3508 loans
listed for “HR” - High Risk borrowers. It is surprising how many loans were
listed for borrowers in the Credit Grade groups lower than C.
Prosper Ratings, from lowest-risk to highest-risk, are labeled AA, A, B, C, D,
E, and HR (“High Risk”). Post-2010 Prosper provides a proprietary “Prosper
Rating” based on the company’s estimation of that borrower’s “estimated loss
rate.”
According to the company, the Prosper Rating is determined by two scores:
Even though Credit Rating and Prosper Rating look similar, I decided not to
combine them for my analysis since there was a decision by the company to no
longer use Credit Grade and switch to Prosper Rating.
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 4.000 6.000 5.906 8.000 11.000 73
The Prosper Score is a custom risk score for post-2010 loans built using
historical Prosper data. The score ranges from 1-11, with 11 being the best,
or lowest risk score. The Prosper score estimates the probability of a loan
going “bad,” where “bad” is the probability of going 60+ days past due within
the first twelve months from the date of loan origination. Prosper Scorecard Link
The Prosper Score looks normally distributed.
The distribution of Income Ranges is left skewed with over 50% of the loans
falling in the $25,000 - $75,000 income ranges. Prior to 2010 almost 6% of
loans didn’t include borrower income range information. This does not appear
to be the case post-2010 where loans do contain borrower income range
information.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1250 0.1700 0.1840 0.2375 0.4975
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0400 0.1364 0.1875 0.1960 0.2573 0.3600
The Borrower Rate is the Borrower’s interest rate for this loan. The median
Borrower Rate for pre-2010 loans was 17% which is lower than the median
18.75% post-2010. The max Borrower Rate for pre-2010 loans was 50% vs 36%
post-2010. The interquartile range pre-2010 is 11.25% where post-2010 the
interquartile range is 12.09%.
It surprised me that Borrowers were paying a higher rate for a Prosper loan
post-2010 compared to pre-2010 since interests rates for things like mortgages
were lower. Link to Historical Mortgage Rates.
I defined a factor variable for loans in “GoodStanding”" vs “InDefault” to
group loans for further analysis. I grouped loans as “InDefault” if the loan
status might affect a borrowers credit rating.
Loans in the “InDefault” group:
Pre-2010 the percentage of loans “In Default” was about 35%. Post-2010 about
10% of loans listed were “InDefault”
The prosper loans dataset contains 113937 rows and 81 variables. There are 70
numeric variables and 26 factor variables. The dataset met the criteria for
tidiness as defined
by Wikipedia here.
The main areas I am focused on are how the change in Prosper Loans business
affected their two customer groups - borrowers and lenders.
Lender - What features are of interest and may impact if the loan will get
paid back?
LoanOriginalAmount
Yes I created some new variables namely:
Each of these variables helped with creating graphs with a different or
labeled groups.
I performed some operations to adjust the data mainly for correctly ordering
factor variables.
This chart shows the relationship between Credit Grade and Upper Credit Score.
There are a surprising number of loans for borrowers either “No Credit”" or
are considered “High Risk”. It also seems odd that all the borrowers in the
“High Risk” are not all in the same category namely “Bad”. That would seem
confusing to a lender to have this group split when the E and D Credit Grades
are much more uniform.
This chart shows the relationship between Prosper Rating and Upper
Credit Score. Post-2010 there are no borrowers with a “Bad” credit score. It
also seems odd that there are borrowers with a “Poor” through “Exception”
Credit Rating in the High Risk category. That would seem very confusing to a
lender to have this mix of credit ratings across the prosper rating groups. It
is very odd to have a “High Risk”" loan for a borrower with an “Exceptional”
credit rating.
Even with the mix of Credit Ratings within a Prosper Rating, there is a clear
increase in better credit scores as the prosper rating values increase.
These two graphs show the relationship between Borrower Rating and the
pre-2010 Credit Grade and post-2010 Prosper Rating. The Borrower Rate for
Credit Grades “NC” through “E” were between 15% - 28%. For Credit Grades D,
C, B, A, and AA there is downward stair step.
The Borrower Rate for Prosper Ratings post-2010 “HR” have a median value above
30% which is higher. There is a clear downward stair step towards “AA” with
the Borrower Rate median and interquartile range below 10%.
These three graphs compare the loan status to each of the three variables
measuring credit quality and risk. The Credit Grade graph shows that the 35%
of defaults in loans pre-2010 occured across all the Credit Grades. I was
surprised to see about 2% of loans in AA in default and made me question the
validity of the rating. As I mentioned earlier in the report, I could not find
details on how the Credit Grade was determined.
The Prosper Rating graph shows a decrease in loans “InDefault”" as the Prosper
Rating increases. This is more in line with what I would expect for a borrower
loan rating.
The Prosper Score is an internal scorecard and estimates the probability of a
loan going “bad,” but only looks at the possibility of a loan going bad within
the first year of the loan. In the univariate analysis, we saw that almost
80% of the loans are for three years. As we can see in the chart above, loans
with a higher prosper score six or above also had about a 1% default rate. If
prosper updated thier model to look at longer timelines, this score could be
more accurate and reduce the post-2010 10% default rate.
I used a scatter plot with an alpha of 1/4 to plot Loan Original Amounts vs the Credit Grade and Prosper Rating. The graph of Credit Grade shows a larger number of defaults for loan amounts greater than $5000 for Credit Grades B and
lower. The higher loan amounts even across credit grades are more likely to
be “InDefault” as is shown by the predominantly blue color.
The Prosper Rating graph shows the much lower rates of default as we saw
earlier. It also shows lower Loan Amounts for borrowers in the High Risk
group post-2010 compared to pre-2010. Post-2010 shows much more restraint and
consistency when loaning money to borrowers with lower ratings.
For both plots, I only looked at Loans for $10000 to be able to compare Monthly
Loan Payments pre-2010 and post-2010. I took the sqrt of the Monthly Loan
Payment to adjust for the skewness in the data. The plot for Monthly Loan
Payments post-2010 shows that interquartile range for payments across all
Prosper Ratings are below $500. For pre-2010 loans there it was interesting
to see that interquartile range that are in default were all above $500. We
can see from these charts that a monthly loan payment below $500 has an impact
on keeping the loan in good standing.
investigation. How did the feature(s) of interest vary with other features in
the dataset? I summarized my findings after each plot or group of plot above.
The Monthly Loan Payment analysis was interesting. Having a payment amount that the borrower can make each month has a big influence on keeping the loan in good standing.
The strongest relationship I found was BorrowerRate and ProsperRating. It was interesting because if a borrower wanted to reduce the rate they paid it wasn’t as transparent to figure out what actions they could take. The Prosper Rating is composed of the Credit Score and the Prosper Score. If they want to increase thier credit score they can call the credit service and see details. For the Prosper Rating or Prosper Score it isn’t as clear how those are calculated and they affect the rate the borrower pays.
##
## Calls:
## Model 1 : lm(formula = BorrowerRate ~ Term, data = subset_vars)
## Model 2 : lm(formula = BorrowerRate ~ Term + CreditScoreRangeUpper, data = subset_vars)
## Model 3 : lm(formula = BorrowerRate ~ Term + CreditScoreRangeUpper + DebtToIncomeRatio,
## data = subset_vars)
## Model 4 : lm(formula = BorrowerRate ~ Term + CreditScoreRangeUpper + DebtToIncomeRatio +
## ProsperScore, data = subset_vars)
## Model 5 : lm(formula = BorrowerRate ~ Term + CreditScoreRangeUpper + DebtToIncomeRatio +
## ProsperScore, data = subset_vars)
## Model 6 : lm(formula = BorrowerRate ~ Term + CreditScoreRangeUpper + DebtToIncomeRatio +
## ProsperScore + ProsperRating..numeric., data = subset_vars)
##
## ================================================================================================================
## Model 1 Model 2 Model 3 Model 4 Model 5 Model 6
## ----------------------------------------------------------------------------------------------------------------
## (Intercept) 0.196*** 0.764*** 0.766*** 0.635*** 0.635*** 0.328***
## (0.001) (0.003) (0.004) (0.003) (0.003) (0.001)
## Term -0.000 0.000*** 0.000*** 0.000*** 0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000) (0.000)
## CreditScoreRangeUpper -0.001*** -0.001*** -0.000*** -0.000*** 0.000***
## (0.000) (0.000) (0.000) (0.000) (0.000)
## DebtToIncomeRatio 0.028*** 0.010*** 0.010*** -0.001**
## (0.001) (0.001) (0.001) (0.000)
## ProsperScore -0.017*** -0.017*** 0.001***
## (0.000) (0.000) (0.000)
## ProsperRating..numeric. -0.045***
## (0.000)
## ----------------------------------------------------------------------------------------------------------------
## sigma 0.074 0.064 0.062 0.051 0.051 0.021
## R-squared 0.000 0.253 0.289 0.526 0.526 0.920
## F 0.015 14032.034 10290.557 20992.080 20992.080 174124.310
## p 0.901 0.000 0.000 0.000 0.000 0.000
## N 82877 82877 75772 75772 75772 75772
## ================================================================================================================
I focused my analysis in this section on looking at how to increase the ability for borrowers to keep thier loan in good standing.
The first two plots looked at MontlyLoanPayment vs Term by LoanStatus for borrowers with a ProsperScore below 6. I thought about looking at this as if I were a data analyst for Prosper - There might be an opportunity for Prosper to increase the Term or provide more flexibility on the Term (4 year loans?) to reduce the monthly payment amount to < $250 and increase the ability for the borrower to make the lower payments.
The second two plots looked at MonthlyLoanPayment vs Loan Original Amount by Loan Status for borrowers with a ProsperCore below 6. Again I thought about looking at this as if I were a data analyst for Prosper - There might be an opportunity to refine the Loan Consolidation model to keep the payment amounts lower. The red line shows loans with the lowest prosper score and therefore most likely to go bad but shows a trend upwards for the payment. If the payments were lower perhaps the loans would be less risky.
strengths and limitations of your model.
I created a model for BorrowerRate to better understand how it was calculated. I looked at Term, CreditScoreRangeUpper, DebtToIncomeRatio, ProsperScore, and ProsperRating..numeric. The correlation between BorrowerRate and ProsperRating..numeric is strong at .92 R-squared. It was suprising the correlation for Credit Score was so low at .29. It would be a challenge as a borrower to figure out how to get a better rate on the Prosper platform without increased transparency into how those metrics are calculated.
The models I put together are simplistic at this point compared to the number of variables associated with peer to peer lending but still insightful. After I get more experience with model development beyond linear regression I would like to revisit this analysis.
I chose this plot because it showed a difference between the Credit Score and the Prosper Rating. It was suprising that a “Poor” credit score would end appear in the higher Prosper Ratings.
For this plot I looked at loans before 2010. I took the sqrt of the Monthly Loan Payment to adjust for the skewness in the data. It was interesting to see the medians montly payment for loans in good standing and loans in default drift as the monthly payment rose. From this chart your can see that a lower monthly payment has an impact on keeping the loan in good standing.
For this plot I looked at loans originated after 2010. The graph shows how currently the monthly payment is correlated to the Loan Amount rather than how if a borrower can consistently make monthly payments. (The correlation between Montly Loan Payment and Loan Original Amount when calculated with cor.test was .91.) As loan amounts increase, monthly payments increase.
Prosper already made an adjustment in 2010 to keep payments lower. It would be intesting for them to model if adding more options for Term and Monthly payment amounts would futher reduce the post-2010 10% default rate.
========================================================
For this exploratory data analysis I decided to take on one of the larger data sets. With 81 variables to explore, it was easy to get off track and look into each of the variables. It took a bit to figure out which variables were important and I was afraid to miss an insight. I did alot of plots as a result and many of them are not included because they not that useful. I decided to keep my focus on questions I thought Prosper’s customers would care about.
I was very surprised at the Peer to Peer business environment pre-2010 that led to action from the SEC. I found it incredibly helpful to have domain knowledge for Peer to Peer Lending and overview of the business - both provided much needed context to what I was seeing in the data.
The interesting models I would like to predict are more complex and outside the scope of EDA. For example the Prosper Score is an internal scorecard and estimates the probability of a loan going “bad,” where “bad” is the probability of going 60+ days past due within the first twelve months from the date of loan origination. This isn’t helpful for loans that are 3 year or 5 year in term. The lenders would probably like a better score to predict if it is likely that a loan will get paid off. It would also be good to know if focusing on lower monthly payments would help keep the loans in good standing. The loans are not secured, Prosper has discontinued use of a secondary market, and the lender is bearing the risk.
What did go well is my EDA skills have improved. I also have a much better understanding of when to apply a transform to a variable and what plots to use and in what circumstance. The example EDA projects were inspiring and I was able to use a few tips. I am excited to take on machine learning. I would like to develop a model that attempt to forecast what type of loans are likely to stay in good standing as a future project.